1 Introduction

1.1 About This Lab

In January 2025, devastating wildfires swept through Los Angeles, becoming some of the most destructive in California’s history. (TIME Magazine, 2025) reported that climate change has “exacerbated the size, intensity, and damage caused by wildfires,” and experts noted that “the key question for living with wildfire is how we as humans manage the risks”.

In this lab, you will analyze real California Wildfire data to investigate these claims. You will explore historical patterns, examine relationships between climate variables and fire outcomes, and develop data-driven insights about wildfire risk management.

1.2 Learning Objectives

By the end of this lab, you will be able to:

  1. Apply exploratory data analysis to investigate real-world environmental phenomena
  2. Create publication-quality visualizations using R and ggplot2
  3. Evaluate claims from news media using statistical evidence
  4. Understand spatial and temporal patterns in wildfire data
  5. Connect statistical analysis to policy-relevant questions about risk management

1.3 Required Packages

# Install packages if needed (uncomment if necessary)
# install.packages(c("tidyverse", "ggplot2", "dplyr", "readr", "scales", 
#                    "viridis", "ggrepel", "corrplot", "GGally", 
#                    "ggridges", "plotly", "RColorBrewer", "patchwork"))

# Load required libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(scales)
library(viridis)
library(ggrepel)
library(corrplot)
library(GGally)
library(ggridges)
library(plotly)
library(RColorBrewer)
library(patchwork)

# Set custom theme for all plots
theme_wildfire <- theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0, color = "#2C3E50"),
    plot.subtitle = element_text(color = "#7F8C8D", size = 11),
    plot.caption = element_text(color = "#95A5A6", size = 9, hjust = 1),
    axis.title = element_text(face = "bold", size = 11),
    axis.text = element_text(size = 10),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "#ECF0F1"),
    legend.position = "bottom",
    legend.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold", size = 11)
  )

# Set as default theme
theme_set(theme_wildfire)

# Define color palettes
fire_colors <- c("#FF6B35", "#D32F2F", "#FFA726", "#8B0000", "#FFD700")
drought_colors <- c("Abnormally Dry" = "#FFEDA0", 
                    "Moderate" = "#FEB24C", 
                    "Severe" = "#F03B20",
                    "Extreme" = "#BD0026",
                    "Exceptional" = "#800026")
region_colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00",
                   "#FFFF33", "#A65628", "#F781BF", "#999999", "#66C2A5")

Before You Begin: ggplot2 Fundamentals

This section prepares you for the visualizations in this lab. If you’re new to R or ggplot2, work through this warm-up before starting Part 1.

Helpful Resources

Keep these references handy as you work through the lab:

Resource Description Link
ggplot2 Cheatsheet Quick reference for all ggplot2 functions https://ggplot2.tidyverse.org
R Graph Gallery Visual examples with code https://r-graph-gallery.com
R Programming 101 Video tutorials for beginners https://youtube.com/playlist?list=PLtL57Fdbwb_C6RS0JtBojTNOMVlgpeJkS

Understanding ggplot2 Syntax

ggplot2 builds plots by adding layers using the + operator. Think of it like stacking transparent sheets:

ggplot(data, aes(x, y)) +     # Layer 1: Canvas + coordinate mapping
  geom_point() +               # Layer 2: Add points
  labs(title = "My Plot") +    # Layer 3: Add labels
  theme_minimal()              # Layer 4: Apply styling

Key components:

Component Purpose Example
ggplot() Initialize the plot with data ggplot(data = my_data)
aes() Map variables to visual properties aes(x = year, y = acres)
geom_*() Specify the type of plot geom_bar(), geom_line(), geom_point()
labs() Add titles and labels labs(title = "Fire Trends")
theme_*() Control appearance theme_minimal()

Common Errors & How to Fix Them

Before you start coding, familiarize yourself with these frequent mistakes:

Error What You Wrote What You Should Write
Missing + between layers ggplot(df, aes(x, y)) geom_point() ggplot(df, aes(x, y)) + geom_point()
Forgetting closing ) aes(x = year, y = acres aes(x = year, y = acres)
Missing comma in aes() aes(x = year y = acres) aes(x = year, y = acres)
Variable name typo aes(x = Year) when column is year aes(x = year) — R is case-sensitive!
Using = instead of == in filter filter(region = "North") filter(region == "North")
Quotes around variable names aes(x = "year") aes(x = year) — no quotes for variables
Plus sign at line start ggplot(df, aes(x, y))
+ geom_point()
ggplot(df, aes(x, y)) +
geom_point()

Tip: If you get an error, read it carefully! R often tells you exactly what’s wrong (e.g., “unexpected symbol” usually means a missing comma or parenthesis).

Warm-Up Exercises

Practice these exercises using the built-in mtcars dataset before tackling the wildfire data.

Exercise A: Your First Plot

Run this code to create a simple scatter plot:

# This is provided for you - just run it!
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point()

What to notice: The aes() maps wt (weight) to the x-axis and mpg (miles per gallon) to the y-axis.

Exercise B: Fill in the Blanks

Complete this code to create a bar chart showing the count of cars by number of cylinders (cyl):

# Fill in the blank: what variable goes on the x-axis?
ggplot(data = mtcars, aes(x = _____)) +
  geom_bar()
Click for Solution
ggplot(data = mtcars, aes(x = cyl)) +
  geom_bar()

Exercise C: Add Color

Modify this scatter plot to color points by the number of cylinders (cyl):

# Add color inside aes() to map cyl to point color
ggplot(data = mtcars, aes(x = wt, y = mpg, _____ = _____)) +
  geom_point(size = 3)
Click for Solution
ggplot(data = mtcars, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(size = 3)

Exercise D: Filter Data

Use filter() to show only 6-cylinder cars, then plot:

# Filter for 6-cylinder cars, then create a scatter plot
mtcars %>%
  filter(cyl _____ _____) %>%
  ggplot(aes(x = wt, y = mpg)) +
  geom_point()
Click for Solution
mtcars %>%
  filter(cyl == 6) %>%
  ggplot(aes(x = wt, y = mpg)) +
  geom_point()

Exercise E: Complete the Pipeline

Put it all together! Fill in the blanks to filter, plot, and label:

mtcars %>%
  filter(hp > _____) %>%                           # Cars with more than 150 horsepower
  ggplot(aes(x = _____, y = _____, color = cyl)) + # Plot weight vs mpg
  geom_point(size = 3) +
  labs(
    title = "_____",                               # Add a descriptive title
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  )
Click for Solution
mtcars %>%
  filter(hp > 150) %>%
  ggplot(aes(x = wt, y = mpg, color = cyl)) +
  geom_point(size = 3) +
  labs(
    title = "Fuel Efficiency of High-Horsepower Cars",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  )

Quick Reference: Common geom Types

You’ll encounter these plot types throughout the lab:

geom Use For Example
geom_bar() Counting categories Number of fires by drought level
geom_col() Bars with pre-calculated values Acres burned by year
geom_line() Trends over time Temperature changes 2001-2024
geom_point() Scatter plots Acres vs. temperature
geom_boxplot() Distribution comparison Acres burned by region
geom_histogram() Single variable distribution Distribution of fire sizes
geom_smooth() Trend lines Adding regression to scatter plots

Ready? Once you’ve completed the warm-up exercises, proceed to Part 1!


2 Introduction & Data Exploration (Questions 1-8)

2.1 Question 1: Context Setting

The TIME Magazine article states: “The key question for living with wildfire is how we as humans manage the risks.” Before exploring the data, what do you think are the main factors that contribute to wildfire risk?

SOLUTION:

  • Climate factors: Temperature, drought conditions, humidity, wind patterns
  • Vegetation/fuel: Dry brush, forest density, dead trees, invasive species
  • Human factors: Development in fire-prone areas (WUI), human-caused ignitions, power line infrastructure
  • Geographic factors: Topography, proximity to wildlands, regional climate patterns
  • Infrastructure: Firefighting resources, early warning systems, evacuation routes
  • Policy factors: Land management practices, building codes, insurance availability

2.2 Question 2: Import Statewide Data

Import the statewide wildfire dataset (ca_wildfires_statewide_2001_2024.csv). Use dim() and str() to explore its structure. How many years of data are included?

# Import the statewide time series data
statewide <- read.csv("ca_wildfires_statewide_2001_2024.csv", stringsAsFactors = FALSE)

# Check dimensions
dim(statewide)
## [1] 24  7
# Examine structure
str(statewide)
## 'data.frame':    24 obs. of  7 variables:
##  $ year            : int  2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 ...
##  $ fire_count      : int  217 254 364 286 322 332 357 474 265 210 ...
##  $ acres_burned    : num  5413 10228 20083 5832 5002 ...
##  $ avg_temp_f      : num  57.9 57.6 58.1 57.8 57.5 ...
##  $ tree_loss_acres : int  52189 68565 61289 66683 47175 86057 136817 214406 114220 35661 ...
##  $ carbon_emissions: int  35641331 39092238 33221508 40625035 38657789 48874782 63318420 111709151 58619291 33145575 ...
##  $ drought_level   : chr  "Abnormally Dry" "Moderate" "Abnormally Dry" "Moderate" ...
# Preview data
head(statewide)

SOLUTION:

The dataset contains 24 rows (years from 2001-2024) and 7 columns (variables). This represents 24 years of California Wildfire data.


2.3 Question 3: Data Dictionary - Variable Definitions

Using the data dictionary, match each variable to its definition. What does tree_loss_acres measure and what units is it in?

SOLUTION:

Variable Definition Units
year Calendar year Numeric (2001-2024)
fire_count Number of fires statewide Count
acres_burned Total acres burned statewide Acres
avg_temp_f State average temperature Degrees Fahrenheit
tree_loss_acres Tree cover loss from all causes Acres
carbon_emissions CO₂ released from all tree cover loss Megagrams (Mg)
drought_level Drought severity category Categorical

tree_loss_acres measures the total tree cover loss in acres (1 hectare = 2.47 acres). This represents the forest area lost from all causes including wildfires, logging, agriculture, natural disturbances, and development.

Important Note: The tree_loss_acres and carbon_emissions variables represent all tree cover loss in California, not just wildfire-related loss. According to Global Forest Watch methodology, tree cover loss includes stand-replacing wildfires, agricultural conversion, logging operations, natural disturbances (storms, flooding, insect damage), and urban development.


2.4 Question 4: Summary Statistics for Acres Burned

Calculate summary statistics for acres_burned using summary(). What is the median acres burned per year?

# Calculate summary statistics
summary(statewide$acres_burned)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2969    6712   12902   17336   21647   73235
# Additional statistics
cat("\n--- Additional Statistics ---\n")
## 
## --- Additional Statistics ---
cat("Mean:", round(mean(statewide$acres_burned), 2), "acres\n")
## Mean: 17336.42 acres
cat("Median:", round(median(statewide$acres_burned), 2), "acres\n")
## Median: 12901.65 acres
cat("Standard Deviation:", round(sd(statewide$acres_burned), 2), "acres\n")
## Standard Deviation: 15874.62 acres
cat("Range:", round(max(statewide$acres_burned) - min(statewide$acres_burned), 2), "acres\n")
## Range: 70266.2 acres

SOLUTION:

The median acres burned per year is approximately 12,901.65 acres. Note that the mean (17,336.42 acres) is considerably higher than the median, indicating the distribution is right-skewed—a few extreme fire years pull the average upward.

2.5 Question 5: Drought Level Distribution

The drought_level variable is categorical. Use table() to count how many years fall into each drought category. Which drought level is most common?

# Count years by drought level
drought_table <- table(statewide$drought_level)
print(drought_table)
## 
## Abnormally Dry    Exceptional        Extreme       Moderate         Severe 
##             10              2              3              6              3
# Calculate proportions
prop.table(drought_table) * 100
## 
## Abnormally Dry    Exceptional        Extreme       Moderate         Severe 
##      41.666667       8.333333      12.500000      25.000000      12.500000
# Create a modern horizontal bar chart
drought_summary <- statewide %>%
  count(drought_level) %>%
  mutate(drought_level = factor(drought_level, 
                                levels = c("Abnormally Dry", "Moderate", "Severe", 
                                          "Extreme", "Exceptional")))

ggplot(drought_summary, aes(x = reorder(drought_level, n), y = n, fill = drought_level)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  geom_text(aes(label = n), hjust = -0.3, fontface = "bold", size = 4) +
  coord_flip() +
  scale_fill_manual(values = c("Abnormally Dry" = "#FFEDA0", 
                               "Moderate" = "#FEB24C", 
                               "Severe" = "#FC4E2A",
                               "Extreme" = "#E31A1C",
                               "Exceptional" = "#800026")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Distribution of Drought Levels in California (2001-2024)",
    subtitle = "Number of years at each drought severity level",
    x = NULL,
    y = "Number of Years",
    caption = "Source: U.S. Drought Monitor Data 2001-2024"
  )

SOLUTION:

“Abnormally Dry” is the most common drought level, occurring in 10 out of 24 years (41.67%). The distribution shows:

  • Abnormally Dry: 10 years (41.67%)
  • Moderate: 6 years (25.00%)
  • Severe: 3 years (12.50%)
  • Extreme: 3 years (12.50%)
  • Exceptional: 2 years (8.33%)

2.6 Question 6: Import County-Level Data

Now import the county-level dataset (ca_wildfires_county_2019_2023.csv). How many counties and regions are in California based on this data?

# Import county-level data
county_data <- read.csv("ca_wildfires_county_2019_2023.csv", stringsAsFactors = FALSE)

# Check dimensions
dim(county_data)
## [1] 58 17
# Number of counties
n_counties <- nrow(county_data)
cat("Number of California counties:", n_counties, "\n")
## Number of California counties: 58
n_regions <- length(unique(county_data$region))
cat("Number of California regions:", n_regions, "\n")
## Number of California regions: 10
# Preview
head(county_data)

SOLUTION:

California has 58 counties and 10 regions represented in this dataset. This matches the actual number of counties in California.

2.6.1 California Regions Reference Map

The county dataset includes a region variable that groups California’s 58 counties into 10 geographic regions. This map shows how counties are organized:
California Census 2020 Regions. Source: California Department of Finance

California Census 2020 Regions. Source: California Department of Finance


2.7 Question 7: Understanding the Wildland-Urban Interface (WUI)

Examine the data dictionary for the county dataset. What is the Wildland-Urban Interface (WUI)? Why is this concept important for understanding wildfire risk?

SOLUTION:

The Wildland-Urban Interface (WUI) is defined as the zone where human development meets or intermingles with wildland vegetation.

California Wildland–Urban Interface (WUI), 2020. Source: SILVIS Lab (University of Wisconsin–Madison) and U.S. Forest Service. Data derived from 2020 U.S. Census block geography and the 2019 National Land Cover Database (NLCD).

California Wildland–Urban Interface (WUI), 2020. Source: SILVIS Lab (University of Wisconsin–Madison) and U.S. Forest Service. Data derived from 2020 U.S. Census block geography and the 2019 National Land Cover Database (NLCD).

According to the data dictionary:

  • acres_wui: Total land area classified as WUI, measured in acres
  • prop_wui: Proportion of total county land area classified as WUI

Why WUI is important for wildfire risk:

  1. Human exposure: People living in WUI areas are directly exposed to wildfire danger
  2. Property vulnerability: Homes and structures in WUI zones face higher destruction risk
  3. Fire ignition: Human activity in WUI increases both accidental and intentional fire starts
  4. Firefighting challenges: WUI fires require protecting structures while fighting wildland fire
  5. Policy implications: Understanding WUI helps target building codes, insurance policies, and evacuation planning

The TIME article specifically mentions that “wildfires will continue to cause devastation as long as areas that were previously natural vegetation are commercially developed” — this is precisely the WUI expansion issue.

Note on homes_damaged: This variable counts residential structures that sustained greater than 50% damage from wildfires, meaning homes that were destroyed or damaged beyond half their value. This threshold ensures we are measuring significant structural damage rather than minor fire effects.


2.8 Question 8: Why San Francisco Has Zero Fires

Notice that San Francisco shows 0 acres burned and 0 homes damaged. Based on the note that “Only fires grown to 10 acres or more are recorded,” why might a densely urban area like San Francisco have no wildfire records?

# Extract San Francisco data
sf_data <- county_data %>% filter(county == "San Francisco")
print(sf_data)
##          county             region population median_income gini_index
## 1 San Francisco Los Angeles County     836321        141446     0.5181
##   prop_65_older drought_level acres_burned prop_area_burned fires_human
## 1        0.1717      Moderate            0                0           0
##   fires_natural fires_other homes_damaged acres_wui prop_wui carbon_emissions
## 1             0           0             0    847.31   0.0284            13141
##   tree_loss_acres
## 1              21
# Compare with other urban areas
urban_comparison <- county_data %>%
  filter(county %in% c("San Francisco", "Los Angeles", "San Diego", "Santa Cruz")) %>%
  select(county, population, acres_burned, homes_damaged, prop_wui, acres_wui)
print(urban_comparison)
##          county population acres_burned homes_damaged prop_wui  acres_wui
## 1   Los Angeles    9848406     41584.73           363   0.3572  928268.57
## 2     San Diego    3282782      7138.23           104   0.4497 1211902.58
## 3 San Francisco     836321         0.00             0   0.0284     847.31
## 4    Santa Cruz     266021     21302.57          1431   0.7977  227245.03

SOLUTION:

San Francisco has zero wildfire records for several interconnected reasons:

  1. Extreme urban density: San Francisco (47 sq mi) is one of the most densely populated cities in the US with ~836,000 residents. There is virtually no wildland vegetation to sustain a 10+ acre fire.

  2. Minimal WUI: San Francisco has only 847 acres of WUI (the smallest in the state) and only 2.84% of its land is classified as WUI. Compare this to Santa Cruz County at 79.77% WUI.

  3. 10-acre reporting threshold: The data only includes fires that grew to 10 acres or more. Any small fires in San Francisco would be quickly contained by the dense urban infrastructure and fire department before reaching this threshold.

  4. Geographic constraints: The city is surrounded by water on three sides (Pacific Ocean, San Francisco Bay), limiting fire spread potential.

  5. Fire suppression infrastructure: Dense urban areas have extensive fire hydrant networks, fire stations, and rapid response capabilities.

Important statistical note: Zero does NOT mean “no fire risk” — it means no fires reached the recording threshold. This is an example of how data collection methods affect what we observe.


3 Historical Wildfire Patterns (Questions 9-19)

3.1 Question 9: Fire Count Time Series

Create a time series line plot showing fire_count from 2001-2024. Add proper axis labels and a descriptive title. Based on the plot, describe the overall trend in the number of fires.

Instructions: Fill in the blanks (_____) to complete the visualization code. Each step is labeled to guide you.

# STEP 1: Create the base plot
# Hint: The statewide dataset contains yearly data. Map year to x-axis, fire_count to y-axis.
ggplot(data = _____, aes(x = _____, y = _____)) +
  
  # STEP 2: Add line geometry (color and linewidth provided)
  geom_line(color = "#D32F2F", linewidth = 1.2) +
  
  # STEP 3: Add point geometry (provided)
  geom_point(color = "#D32F2F", size = 3) +
  
  # STEP 4: Add smoothed trend line
  # Hint: Use method = "loess" for local polynomial regression
  geom_smooth(method = "_____", se = TRUE, alpha = 0.2, color = "#1976D2", linetype = "dashed") +
  
  # STEP 5: Set x-axis breaks (provided)
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  
  # STEP 6: Format y-axis with commas (provided)
  scale_y_continuous(labels = comma) +
  
  # STEP 7: Add labels
  # Hint: Write a title that describes what the plot shows
  labs(
    title = "_____",
    subtitle = "Number of fires reaching 10+ acres per year with trend line",
    x = "Year",
    y = "Number of Fires",
    caption = "Source: CALFIRE Data | Dashed line shows LOESS smoothed trend"
  )
Click for Solution
ggplot(data = statewide, aes(x = year, y = fire_count)) +
  geom_line(color = "#D32F2F", linewidth = 1.2) +
  geom_point(color = "#D32F2F", size = 3) +
  geom_smooth(method = "loess", se = TRUE, alpha = 0.2, color = "#1976D2", linetype = "dashed") +
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Annual Wildfire Count in California (2001-2024)",
    subtitle = "Number of fires reaching 10+ acres per year with trend line",
    x = "Year",
    y = "Number of Fires",
    caption = "Source: CALFIRE Data | Dashed line shows LOESS smoothed trend"
  )

SOLUTION:

The fire count shows high variability with no clear linear trend, but some notable patterns:

  • Minimum: 210 fires in 2010
  • Maximum: 640 fires in 2017
  • Recent peak: 561 fires in 2024

The trend appears cyclical rather than consistently increasing. Notable spikes occurred in 2008, 2017, and 2020. The most recent years (2022-2024) show continued elevated fire counts compared to earlier decades.


3.2 Question 10: Acres Burned Time Series

Now create a similar time series for acres_burned. Are the patterns similar to fire count? What year had the most acres burned?

Instructions: Fill in the blanks to complete the code.

# STEP 1: Find the peak year for annotation
# Hint: Filter where acres_burned equals the maximum value
peak_year <- statewide %>% filter(acres_burned == _____(acres_burned))

# STEP 2: Create the base plot
# Hint: Same dataset, but now y = acres_burned
ggplot(data = _____, aes(x = _____, y = _____)) +
  
  # STEP 3: Add line geometry (provided)
  geom_line(color = "#FF6B35", linewidth = 1.2) +
  
  # STEP 4: Add points (provided)
  geom_point(color = "#FF6B35", size = 3) +
  
  # STEP 5: Highlight the peak year with a different point shape (provided)
  geom_point(data = peak_year, aes(x = year, y = acres_burned), 
             color = "#8B0000", size = 5, shape = 18) +
  
  # STEP 6: Add text label for peak year (provided)
  geom_text(data = peak_year, aes(label = paste0(year, "\n", comma(round(acres_burned)), " acres")),
            vjust = -0.8, hjust = 0.5, fontface = "bold", size = 3.5, color = "#8B0000") +
  
  # STEP 7: Format axes (provided)
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = comma) +
  
  # STEP 8: Add labels (provided)
  labs(
    title = "Total Acres Burned by Wildfires in California (2001-2024)",
    subtitle = "Peak year highlighted | Fire severity shows increasing extremes",
    x = "Year",
    y = "Acres Burned",
    caption = "Source: CALFIRE Data"
  )
Click for Solution
peak_year <- statewide %>% filter(acres_burned == max(acres_burned))

ggplot(data = statewide, aes(x = year, y = acres_burned)) +
  geom_line(color = "#FF6B35", linewidth = 1.2) +
  geom_point(color = "#FF6B35", size = 3) +
  geom_point(data = peak_year, aes(x = year, y = acres_burned), 
             color = "#8B0000", size = 5, shape = 18) +
  geom_text(data = peak_year, aes(label = paste0(year, "\n", comma(round(acres_burned)), " acres")),
            vjust = -0.8, hjust = 0.5, fontface = "bold", size = 3.5, color = "#8B0000") +
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Total Acres Burned by Wildfires in California (2001-2024)",
    subtitle = "Peak year highlighted | Fire severity shows increasing extremes",
    x = "Year",
    y = "Acres Burned",
    caption = "Source: CALFIRE Data"
  )

SOLUTION:

2020 had the most acres burned with approximately 73,235 acres.

The patterns differ significantly from fire count:

  • Fire count and acres burned are not perfectly correlated — fewer fires can burn more area if conditions are extreme
  • The 2020 peak represents a major outlier (nearly 3x the median)
  • Years 2017, 2018, 2020, and 2021 show notably larger burn areas
  • The pattern suggests fires are becoming larger on average, even when counts don’t dramatically increase

3.3 Question 11: Testing the “Larger Fires” Claim

The TIME article claims wildfires are becoming “larger.” Calculate the mean acres burned for the first half (2001-2012) vs. second half (2013-2024) of the dataset. Does this support the claim?

# Split data into halves
first_half <- statewide %>% filter(year <= 2012)
second_half <- statewide %>% filter(year > 2012)

# Calculate statistics for each half
comparison <- data.frame(
  Period = c("2001-2012 (First Half)", "2013-2024 (Second Half)"),
  Mean_Acres = c(mean(first_half$acres_burned), mean(second_half$acres_burned)),
  Median_Acres = c(median(first_half$acres_burned), median(second_half$acres_burned)),
  Max_Acres = c(max(first_half$acres_burned), max(second_half$acres_burned)),
  Total_Acres = c(sum(first_half$acres_burned), sum(second_half$acres_burned))
)

print(comparison)
##                    Period Mean_Acres Median_Acres Max_Acres Total_Acres
## 1  2001-2012 (First Half)   12141.13     10297.43  26589.35    145693.5
## 2 2013-2024 (Second Half)   22531.71     14341.78  73235.29    270380.5
# Calculate percent increase
pct_increase_mean <- (comparison$Mean_Acres[2] - comparison$Mean_Acres[1]) / comparison$Mean_Acres[1] * 100
cat("\nPercent increase in mean acres burned:", round(pct_increase_mean, 1), "%\n")
## 
## Percent increase in mean acres burned: 85.6 %

SOLUTION:

The data supports the claim that wildfires are becoming larger:

  • Mean acres burned increased by approximately 85.6% from the first half (12,141 acres) to the second half (22,531 acres)
  • The median also increased substantially
  • The maximum fire severity in the second half (73,235 acres in 2020) far exceeds the first half maximum (26,589 acres in 2008)

This is consistent with the TIME article’s claim that climate change has “exacerbated the size” of wildfires.


3.4 Question 12: Temperature Change Between Periods

Using the same first-half (2001-2012) and second-half (2013-2024) split, calculate the difference in mean temperature between the two periods. Is this temperature change significant in the context of climate science?

# Calculate mean temperature for each period
temp_first_half <- mean(first_half$avg_temp_f, na.rm = TRUE)
temp_second_half <- mean(second_half$avg_temp_f, na.rm = TRUE)

# Calculate the difference
temp_difference <- temp_second_half - temp_first_half

cat("Mean Temperature (2001-2012):", round(temp_first_half, 2), "°F\n")
## Mean Temperature (2001-2012): 57.49 °F
cat("Mean Temperature (2013-2024):", round(temp_second_half, 2), "°F\n")
## Mean Temperature (2013-2024): 58.87 °F
cat("Temperature Increase:", round(temp_difference, 2), "°F\n")
## Temperature Increase: 1.38 °F

SOLUTION:

\[ \begin{array}{ccc} \text{1.38}^\circ \mathrm{F} \approx \text{0.77}^\circ \mathrm{C}\text{ Increase} & \Longrightarrow & \text{0.64}^\circ \mathrm{C}\,/\,\text{decade} & \Longrightarrow & \text{111% Burn Area Increase per }^\circ \mathrm{C} \\[6pt] \end{array} \]

The mean temperature increased by approximately 1.38°F (~0.77°C) between the two 12-year periods.

Is this significant? Yes, this is a substantial change in climate science terms:

  1. Rate of change (°C): An increase of 0.77 °C over a 12-year period corresponds to a warming rate of approximately 0.64 °C per decade.

  2. Global Context: This is 3.5 times faster than the global average warming rate of approximately 0.18 °C (0.32 °F) per decade observed since 1981.

  3. Amplification Effect: According to Abatzoglou & Williams (2016) in Proceedings of the National Academy of Sciences (PNAS), each 1 °C of warming can increase wildfire area burned by 100–600% in the western U.S. due to drier vegetation and increased vapor pressure deficit.

  4. Validating with our data: From Question 11, we found acres burned increased by ≈85.6% while temperature rose ≈0.77 °C. This yields an observed rate of ≈111% per °C (85.6% ÷ 0.77 °C), which falls precisely within the PNAS predicted range.

Key insight: While 1.38°F may seem small in everyday terms (you wouldn’t notice it on a thermometer), in climate systems it represents a significant shift that cascades through ecosystems.


3.5 Question 13: Dual Variable Visualization

Create a dual-axis or faceted plot showing both fire count and acres burned over time. What relationship do you observe between the number of fires and total area burned?

Instructions: Fill in the blanks to reshape data and create faceted visualization.

# STEP 1: Reshape data from wide to long format for faceting
# Hint: Select year, fire_count, and acres_burned columns
statewide_long <- statewide %>%
  select(year, _____, _____) %>%
  
  # STEP 2: Pivot to long format
  # Hint: The cols argument specifies which columns to pivot
  pivot_longer(cols = c(_____, _____),
               names_to = "metric",
               values_to = "value") %>%
  
  # STEP 3: Create readable labels (provided)
  mutate(metric = case_when(
    metric == "fire_count" ~ "Number of Fires",
    metric == "acres_burned" ~ "Acres Burned"
  ))

# STEP 4: Create faceted plot
# Hint: Map year to x, value to y, color by metric
ggplot(statewide_long, aes(x = _____, y = _____, color = _____)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  
  # STEP 5: Create facets - one panel per metric
  # Hint: Use scales = "free_y" so each panel has its own y-axis scale
  facet_wrap(~metric, ncol = 1, scales = "_____") +
  
  # STEP 6: Format axes and add labels (provided)
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = c("Number of Fires" = "#1976D2", "Acres Burned" = "#D32F2F")) +
  labs(
    title = "California Wildfire Trends: Count vs. Severity (2001-2024)",
    subtitle = "Fire count and acres burned show different patterns over time",
    x = "Year",
    y = NULL,
    caption = "Source: CALFIRE Data"
  ) +
  theme(legend.position = "none",
        strip.background = element_rect(fill = "#ECF0F1", color = NA))
Click for Solution
statewide_long <- statewide %>%
  select(year, fire_count, acres_burned) %>%
  pivot_longer(cols = c(fire_count, acres_burned),
               names_to = "metric",
               values_to = "value") %>%
  mutate(metric = case_when(
    metric == "fire_count" ~ "Number of Fires",
    metric == "acres_burned" ~ "Acres Burned"
  ))

ggplot(statewide_long, aes(x = year, y = value, color = metric)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  facet_wrap(~metric, ncol = 1, scales = "free_y") +
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = c("Number of Fires" = "#1976D2", "Acres Burned" = "#D32F2F")) +
  labs(
    title = "California Wildfire Trends: Count vs. Severity (2001-2024)",
    subtitle = "Fire count and acres burned show different patterns over time",
    x = "Year",
    y = NULL,
    caption = "Source: CALFIRE Data"
  ) +
  theme(legend.position = "none",
        strip.background = element_rect(fill = "#ECF0F1", color = NA))

SOLUTION:

The relationship between fire count and acres burned is moderate but not perfectly correlated:

  • Some years have many fires but relatively few acres burned (e.g., 2017 had 640 fires but not the most acres)
  • Other years show devastating fire severity despite moderate counts (e.g., 2020, 2021)
  • This indicates that fire intensity matters as much as frequency — a few large fires can cause more damage than many small ones
  • The correlation coefficient between these variables is approximately 0.67

3.6 Question 14: Coefficient of Variation

The article mentions climate change has extended the fire season. While we can’t directly measure season length, we can examine year-to-year variability. Calculate the coefficient of variation (CV = sd/mean × 100) for acres burned. What does high variability suggest?

# Calculate coefficient of variation
cv_acres <- (sd(statewide$acres_burned) / mean(statewide$acres_burned)) * 100
cv_firecount <- (sd(statewide$fire_count) / mean(statewide$fire_count)) * 100

cat("Coefficient of Variation (CV) for Acres Burned:", round(cv_acres, 1), "%\n")
## Coefficient of Variation (CV) for Acres Burned: 91.6 %
cat("Coefficient of Variation (CV) for Fire Count:", round(cv_firecount, 1), "%\n")
## Coefficient of Variation (CV) for Fire Count: 30.5 %

SOLUTION:

The coefficient of variation (CV) tells us how spread out the data is relative to its average value. When the spread is small compared to the average, outcomes are more predictable. When the spread is large compared to the average, outcomes are more unpredictable.

The coefficient of variation for acres burned is approximately 91.6%, which is very high. This means the standard deviation is nearly as large as the mean.

Interpretation:

  • High CV indicates extreme year-to-year unpredictability in fire severity
  • This is consistent with “climate whiplash” — rapid swings between wet and dry conditions
  • Policy implication: Fire management resources must be prepared for both average and extreme years
  • The CV for fire count (30.5%) is much lower, confirming that severity is more variable than frequency

3.7 Question 15: Top 10 Most Destructive Fire Years (Lollipop Chart)

Create a lollipop chart showing the top 10 most destructive fire years by acres burned. Which 3 years had the most devastating fires?

Instructions: Fill in the blanks to filter data and create the lollipop chart.

# STEP 1: Get top 10 years by acres burned
# Hint: Use arrange() with desc() to sort descending, then head() to get top 10
top10_years <- statewide %>%
  arrange(_____(acres_burned)) %>%
  _____(10) %>%
  mutate(year = factor(year))

# STEP 2: Create the lollipop chart
# Hint: x = reordered year by acres_burned, y = acres_burned
ggplot(top10_years, aes(x = reorder(_____, _____), y = _____)) +
  
  # STEP 3: Add the "stick" part of the lollipop
  # Hint: geom_segment draws lines from y=0 to y=acres_burned
  geom_segment(aes(xend = year, y = 0, yend = acres_burned), 
               color = "#D32F2F", linewidth = 1.2) +
  
  # STEP 4: Add the "candy" part (point at the top)
  geom_point(color = "#8B0000", size = 5) +
  
  # STEP 5: Add value labels (provided)
  geom_text(aes(label = comma(round(acres_burned))), 
            hjust = -0.2, size = 3.5, fontface = "bold") +
  
  # STEP 6: Flip coordinates for horizontal display
  coord_flip() +
  
  # STEP 7: Format and label (provided)
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 10 Most Destructive Wildfire Years in California",
    subtitle = "Ranked by total acres burned | 2020 stands out as an extreme outlier",
    x = "Year",
    y = "Acres Burned",
    caption = "Source: CALFIRE Data 2001-2024"
  )
Click for Solution
top10_years <- statewide %>%
  arrange(desc(acres_burned)) %>%
  head(10) %>%
  mutate(year = factor(year))

ggplot(top10_years, aes(x = reorder(year, acres_burned), y = acres_burned)) +
  geom_segment(aes(xend = year, y = 0, yend = acres_burned), 
               color = "#D32F2F", linewidth = 1.2) +
  geom_point(color = "#8B0000", size = 5) +
  geom_text(aes(label = comma(round(acres_burned))), 
            hjust = -0.2, size = 3.5, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 10 Most Destructive Wildfire Years in California",
    subtitle = "Ranked by total acres burned | 2020 stands out as an extreme outlier",
    x = "Year",
    y = "Acres Burned",
    caption = "Source: CALFIRE Data 2001-2024"
  )

SOLUTION:

The three most devastating fire years were:

  1. 2020: 73,235 acres burned
  2. 2021: 48,956 acres burned
  3. 2018: 28,980 acres burned

Notably, 6 of the top 10 worst fire years occurred in the second half of the dataset (2013-2024), further supporting the claim that fires are becoming more severe.


3.8 Question 16: Top 10 Counties by Acres Burned

Using the county dataset, calculate the total acres burned across all 58 counties. Create a horizontal bar chart showing the top 10 counties by total acres burned.

Instructions: Fill in the blanks to create the bar chart.

# STEP 1: Get top 10 counties by acres burned
# Hint: Similar to previous question - arrange descending, take top 10
top10_counties_acres <- county_data %>%
  arrange(_____(acres_burned)) %>%
  _____(10)

# STEP 2: Create horizontal bar chart
# Hint: x = reordered county, y = acres_burned, fill = acres_burned for color gradient
ggplot(top10_counties_acres, aes(x = reorder(county, acres_burned), y = acres_burned, fill = _____)) +
  
  # STEP 3: Add bars (provided)
  geom_col(width = 0.7) +
  
  # STEP 4: Add value labels (provided)
  geom_text(aes(label = comma(round(acres_burned))), 
            hjust = -0.1, size = 3.5, fontface = "bold") +
  
  # STEP 5: Flip to horizontal
  coord_flip() +
  
  # STEP 6: Use inferno color palette
  # Hint: viridis has options: "viridis", "magma", "plasma", "inferno", "cividis"
  scale_fill_viridis(option = "_____", direction = -1, guide = "none") +
  
  # STEP 7: Format and label (provided)
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 10 California Counties by Acres Burned (2019-2023)",
    subtitle = "Plumas County experienced the most fire damage by area",
    x = NULL,
    y = "Total Acres Burned",
    caption = "Source: CALFIRE Data"
  )
Click for Solution
top10_counties_acres <- county_data %>%
  arrange(desc(acres_burned)) %>%
  head(10)

ggplot(top10_counties_acres, aes(x = reorder(county, acres_burned), y = acres_burned, fill = acres_burned)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = comma(round(acres_burned))), 
            hjust = -0.1, size = 3.5, fontface = "bold") +
  coord_flip() +
  scale_fill_viridis(option = "inferno", direction = -1, guide = "none") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 10 California Counties by Acres Burned (2019-2023)",
    subtitle = "Plumas County experienced the most fire damage by area",
    x = NULL,
    y = "Total Acres Burned",
    caption = "Source: CALFIRE Data"
  )

SOLUTION:

The top 10 counties by acres burned are predominantly in Northern California and the Sierra Nevada region:

  1. Plumas: 251,679 acres
  2. Trinity: 133,060 acres
  3. Tehama: 122,500 acres
  4. Siskiyou: 118,326 acres
  5. Tulare: 73,062 acres

These are largely rural, forested counties with extensive wildland areas.


3.9 Question 17: Top 10 Counties by Homes Damaged

Create a similar chart for homes damaged. Are the counties with the most acres burned the same as those with the most homes damaged? Name any counties that appear in one top 10 but not the other.

Note: The homes_damaged variable counts residential structures that sustained greater than 50% damage from wildfires—meaning homes that were destroyed or significantly damaged beyond half their value.

# Top 10 counties by homes damaged
top10_counties_homes <- county_data %>%
  arrange(desc(homes_damaged)) %>%
  head(10)

ggplot(top10_counties_homes, aes(x = reorder(county, homes_damaged), y = homes_damaged, fill = homes_damaged)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = comma(homes_damaged)), 
            hjust = -0.1, size = 3.5, fontface = "bold") +
  coord_flip() +
  scale_fill_viridis(option = "magma", direction = -1, guide = "none") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Top 10 California Counties by Homes Damaged (2019-2023)",
    subtitle = "Butte County leads due to the devastating Camp Fire aftermath",
    x = NULL,
    y = "Homes Damaged or Destroyed (>50% damage)",
    caption = "Source: CALFIRE Data"
  )

# Compare the two lists
acres_top10 <- top10_counties_acres$county
homes_top10 <- top10_counties_homes$county

# Counties in acres top 10 but not homes top 10
only_acres <- setdiff(acres_top10, homes_top10)
cat("Counties in top 10 for ACRES but not HOMES:", paste(only_acres, collapse = ", "), "\n")
## Counties in top 10 for ACRES but not HOMES: Trinity, Tehama, Tulare, Lassen, Del Norte, Los Angeles
# Counties in homes top 10 but not acres top 10
only_homes <- setdiff(homes_top10, acres_top10)
cat("Counties in top 10 for HOMES but not ACRES:", paste(only_homes, collapse = ", "), "\n")
## Counties in top 10 for HOMES but not ACRES: Butte, Santa Cruz, Napa, Sonoma, Solano, Shasta

SOLUTION:

The lists are substantially different, highlighting a critical insight:

Counties in both top 10 lists: Plumas, Siskiyou, Butte, Shasta

In acres top 10 but NOT homes top 10: Trinity, Tehama, Tulare, Del Norte, Fresno, Lassen
In homes top 10 but NOT acres top 10: Santa Cruz, Napa, El Dorado, Sonoma, Solano, Los Angeles

Key insight: Large areas burned in remote, unpopulated counties may cause fewer home losses than smaller fires in densely populated WUI areas. Where fires occur matters as much as how large they are. Remember that homes damaged refers to structures with >50% damage, so these represent significant losses.


3.10 Question 18: WUI and Home Damage Correlation

The TIME article emphasizes that many people “living in the west are living in areas prone to wildfire.” Calculate the correlation between prop_wui (proportion living in WUI) and homes_damaged. Interpret this value.

# Calculate correlation
cor_wui_homes <- cor(county_data$prop_wui, county_data$homes_damaged, use = "complete.obs")
cat("Correlation between prop_wui and homes_damaged:", round(cor_wui_homes, 3), "\n")
## Correlation between prop_wui and homes_damaged: 0.332
# Test significance
cor_test <- cor.test(county_data$prop_wui, county_data$homes_damaged)
print(cor_test)
## 
##  Pearson's product-moment correlation
## 
## data:  county_data$prop_wui and county_data$homes_damaged
## t = 2.6323, df = 56, p-value = 0.01094
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08041755 0.54353170
## sample estimates:
##       cor 
## 0.3318209

SOLUTION:

The correlation coefficient is approximately 0.33, indicating a weak positive relationship between the proportion of a county in the WUI and homes damaged (>50% damage).

Interpretation:

  • The relationship is positive (more WUI → more home damage) but weak
  • This suggests WUI is one of many factors affecting home damage, not the sole determinant
  • Other factors like total population, fire suppression resources, and building materials also matter
  • The correlation is statistically significant (p < 0.05) but explains only about 5% of the variance

3.11 Question 19: WUI vs. Homes Damaged Scatter Plot

Create a scatter plot with prop_wui on the x-axis and homes_damaged on the y-axis. Add county labels for the 5 counties with the most home damage. What pattern do you observe?

# Identify top 5 counties for labeling
top5_homes <- county_data %>%
  arrange(desc(homes_damaged)) %>%
  head(5)

ggplot(county_data, aes(x = prop_wui, y = homes_damaged)) +
  geom_point(aes(size = population/1000000), alpha = 0.6, color = "#D32F2F") +
  geom_smooth(method = "lm", se = TRUE, color = "#1976D2", linetype = "dashed") +
  geom_label_repel(data = top5_homes,
                   aes(label = county),
                   size = 3.5,
                   fontface = "bold",
                   box.padding = 0.5,
                   point.padding = 0.3,
                   segment.color = "gray50",
                   max.overlaps = 20) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) +
  scale_size_continuous(name = "Population (millions)", range = c(2, 10)) +
  labs(
    title = "Wildland-Urban Interface and Home Damage by County",
    subtitle = "Counties with most home damage labeled | Point size indicates population",
    x = "Proportion of County in WUI",
    y = "Homes Damaged or Destroyed (>50% damage)",
    caption = "Source: CAL FIRE (2001–2024) and U.S. Census Bureau, ACS 5-Year Estimates (2023)"
  ) +
  theme(legend.position = "right")

SOLUTION:

Observed patterns:

  1. Weak positive trend: The regression line shows a slight upward slope, but with high scatter
  2. Outliers tell the story: Butte County has extreme home damage (2,834 homes with >50% damage) despite moderate WUI proportion (~43%)
  3. High WUI ≠ High damage: Santa Cruz has very high WUI (~80%) but moderate damage
  4. Population matters: Larger bubbles (more population) don’t necessarily mean more damage
  5. Regional variation: The most damaged counties are in Northern California’s Sierra foothills

Policy implication: WUI exposure is a risk factor, but other variables (fire behavior, evacuation success, building codes) strongly influence outcomes.


4 Climate Variables and Fire Relationships (Questions 20-31)

4.1 Question 20: Temperature vs. Acres Burned

The TIME article states: “Climate change has exacerbated the size, intensity, and damage caused by wildfires.” We’ll investigate the relationship between temperature and fire outcomes. First, create a scatter plot of avg_temp_f vs. acres_burned with a trend line using geom_smooth().

Instructions: Fill in the blanks to create the scatter plot.

# STEP 1: Convert drought_level to ordered factor for proper color mapping
# (This code is provided and should run first)
statewide$drought_level <- factor(
  statewide$drought_level,
  levels = c(
    "None",
    "Abnormally Dry",
    "Moderate",
    "Severe",
    "Extreme",
    "Exceptional"
  ),
  ordered = TRUE
)

# STEP 2: Create scatter plot
# Hint: x = avg_temp_f, y = acres_burned
ggplot(statewide, aes(x = _____, y = _____)) +
  
  # STEP 3: Add points colored by drought level
  # Hint: Put color = drought_level inside aes()
  geom_point(aes(color = _____), size = 4, alpha = 0.8) +
  
  # STEP 4: Add linear trend line
  # Hint: method = "lm" for linear model
  geom_smooth(method = "_____", se = TRUE, color = "#1976D2", linetype = "dashed") +
  
  # STEP 5: Apply custom colors for drought levels (provided)
  scale_color_manual(values = drought_colors, name = "Drought Level") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Relationship Between Temperature and Fire Severity",
    subtitle = "Higher temperatures tend to correlate with more acres burned",
    x = "Average Annual Temperature (F)",
    y = "Acres Burned",
    caption = "Sources: CAL FIRE (2001-2024), NOAA, and National Drought Mitigation Center"
  ) +
  theme(legend.position = "bottom")
Click for Solution
ggplot(statewide, aes(x = avg_temp_f, y = acres_burned)) +
  geom_point(aes(color = drought_level), size = 4, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE, color = "#1976D2", linetype = "dashed") +
  scale_color_manual(values = drought_colors, name = "Drought Level") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Relationship Between Temperature and Fire Severity",
    subtitle = "Higher temperatures tend to correlate with more acres burned",
    x = "Average Annual Temperature (F)",
    y = "Acres Burned",
    caption = "Sources: CAL FIRE (2001-2024), NOAA, and National Drought Mitigation Center"
  ) +
  theme(legend.position = "bottom")

SOLUTION:

The scatter plot reveals a positive relationship between temperature and acres burned. Warmer years tend to experience more severe fire seasons. The points are colored by drought level, showing that the most extreme fire years often coincide with drought conditions.


4.2 Question 21: Temperature-Acres Correlation

Calculate the Pearson correlation coefficient between avg_temp_f and acres_burned. Is the relationship positive or negative? Is it strong or weak?

# Calculate Pearson correlation
cor_temp_acres <- cor(statewide$avg_temp_f, statewide$acres_burned, use = "complete.obs")
cat("Pearson correlation (temp vs acres):", round(cor_temp_acres, 3), "\n")
## Pearson correlation (temp vs acres): 0.494
# Formal test
cor_test_temp <- cor.test(statewide$avg_temp_f, statewide$acres_burned)
print(cor_test_temp)
## 
##  Pearson's product-moment correlation
## 
## data:  statewide$avg_temp_f and statewide$acres_burned
## t = 2.662, df = 22, p-value = 0.01424
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1126125 0.7480403
## sample estimates:
##       cor 
## 0.4935868

Interpretation Guide:

  • |r| < 0.3: Weak

  • 0.3 ≤ |r| < 0.5: Moderate

  • 0.5 ≤ |r| < 0.7: Strong

  • |r| ≥ 0.7: Very Strong

SOLUTION:

The Pearson correlation coefficient is approximately 0.49, indicating a moderate positive relationship.

  • Direction: Positive (higher temperatures → more acres burned)
  • Strength: Moderate (explains about 20% of the variance in acres burned)
  • Significance: The p-value is approximately 0.014, indicating statistical significance at α = 0.05

This supports the claim that warming temperatures are associated with increased fire severity.


4.3 Question 22: Temperature vs. Tree Loss

Now examine the relationship between temperature and tree_loss_acres. Create a scatter plot with a loess smoother. Describe the relationship.

ggplot(statewide, aes(x = avg_temp_f, y = tree_loss_acres)) +
  geom_point(color = "#2E7D32", size = 4, alpha = 0.7) +
  geom_smooth(method = "loess", se = TRUE, color = "#8B4513", span = 0.75) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Temperature and Tree Cover Loss in California (2001-2024)",
    subtitle = "LOESS smoother reveals non-linear relationship with threshold around 58°F",
    x = "Average Annual Temperature (°F)",
    y = "Tree Cover Loss (acres)",
    caption = "Sources: NOAA and Global Forest Watch"
  )

SOLUTION:

The relationship between temperature and tree loss appears non-linear:

  • Below ~58°F, tree loss remains relatively low
  • Above ~58°F, tree loss increases dramatically
  • This suggests a threshold effect: once temperatures exceed a certain point, vegetation becomes significantly more vulnerable

The LOESS smoother captures this pattern better than a linear fit would.

Note: Tree cover loss includes all causes (wildfires, logging, agriculture, natural disturbances), not just fire-related loss.


4.4 Question 23: Correlation Matrix Heatmap

Create a correlation matrix heatmap for all numeric variables in the statewide dataset. Which two variables have the strongest positive correlation?

# Select numeric variables
numeric_vars <- statewide %>%
  select(year,fire_count, acres_burned, avg_temp_f, tree_loss_acres, carbon_emissions)

# Calculate correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Better visualization with corrplot
corrplot(cor_matrix, 
         method = "color",
         type = "upper",
         order = "hclust",
         addCoef.col = "black",
         number.cex = 0.8,
         tl.col = "black",
         tl.srt = 45,
         col = colorRampPalette(c("#2166AC", "#F7F7F7", "#B2182B"))(100),
         title = "Correlation Matrix: California Wildfire Variables",
         mar = c(0, 0, 2, 0))

SOLUTION:

The strongest positive correlation is between tree_loss_acres and carbon_emissions with r ≈ 0.99.

Interpretation:

  • This makes physical sense: more tree loss = more carbon released from vegetation loss
  • Note that both variables measure all tree cover loss (not just wildfires), which is why their correlation is so high
  • Other strong correlations:
    • acres_burned & tree_loss_acres (r ≈ 0.81)
    • acres_burned & carbon_emissions (r ≈ 0.75)
  • year shows positive correlations with several variables, indicating that wildfire-related impacts have generally increased over time.

4.5 Question 24: Boxplots by Drought Level

The article discusses how drought conditions affect fire severity. Create side-by-side boxplots comparing acres_burned across different drought_level categories.

Instructions: Fill in the blanks to create the boxplot visualization.

# STEP 1: Ensure drought levels are ordered by severity
statewide <- statewide %>%
  mutate(drought_level = factor(drought_level, 
                                levels = c("Abnormally Dry", "Moderate", "Severe", 
                                          "Extreme", "Exceptional")))

# STEP 2: Create boxplot
# Hint: x = drought_level, y = acres_burned, fill = drought_level
ggplot(statewide, aes(x = _____, y = _____, fill = _____)) +
  
  # STEP 3: Add boxplot geometry
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 3) +
  
  # STEP 4: Add individual points with jitter
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  
  # STEP 5: Apply drought color palette (provided)
  scale_fill_manual(values = drought_colors, guide = "none") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Fire Severity by Drought Level in California",
    subtitle = "Boxplots show distribution of acres burned for each drought category",
    x = "Drought Level",
    y = "Acres Burned",
    caption = "Source: National Drought Mitigation Center Data 2001-2024"
  ) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))
Click for Solution
statewide <- statewide %>%
  mutate(drought_level = factor(drought_level, 
                                levels = c("Abnormally Dry", "Moderate", "Severe", 
                                          "Extreme", "Exceptional")))

ggplot(statewide, aes(x = drought_level, y = acres_burned, fill = drought_level)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, outlier.size = 3) +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  scale_fill_manual(values = drought_colors, guide = "none") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Fire Severity by Drought Level in California",
    subtitle = "Boxplots show distribution of acres burned for each drought category",
    x = "Drought Level",
    y = "Acres Burned",
    caption = "Source: National Drought Mitigation Center Data 2001-2024"
  ) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

SOLUTION:

The boxplots reveal no simple linear relationship between drought severity and acres burned:

  • “Moderate” drought years actually show the highest median acres burned
  • “Exceptional” drought years show high variability
  • This suggests drought is one of many factors — wind patterns, vegetation moisture timing, and ignition sources also matter

4.6 Question 25: Median Acres by Drought Level

Calculate the median acres burned for each drought level. Do more severe droughts correspond to more acres burned?

# Calculate median acres by drought level
drought_summary <- statewide %>%
  group_by(drought_level) %>%
  summarise(
    n_years = n(),
    median_acres = median(acres_burned),
    mean_acres = mean(acres_burned),
    max_acres = max(acres_burned)
  ) %>%
  arrange(factor(drought_level, levels = c("Abnormally Dry", "Moderate", "Severe", 
                                           "Extreme", "Exceptional")))

print(drought_summary)
## # A tibble: 5 × 5
##   drought_level  n_years median_acres mean_acres max_acres
##   <fct>            <int>        <dbl>      <dbl>     <dbl>
## 1 Abnormally Dry      10        6526.     11271.    25296.
## 2 Moderate             6       21820.     26986.    73235.
## 3 Severe               3       12902.     15219.    22388.
## 4 Extreme              3       11147.     22370.    48956.
## 5 Exceptional          2       14342.     14342.    15782.

SOLUTION:

The relationship is not straightforward:

Key insight: At the state level, years classified as having “moderate” drought show the highest typical (median) acres burned. However, this pattern can be misleading for two important reasons:

  1. State averages hide local extremes A year labeled as “moderate drought” for the entire state may still include counties experiencing severe or extreme drought, which are the places where large fires are most likely to occur. When we average conditions across the whole state, we lose this local detail.

  2. Annual data misses rapid climate shifts within a year Yearly drought categories cannot capture climate whiplash—for example, a wet winter that increases vegetation growth followed by a hot, dry summer that dries out that fuel and leads to fires. These short-term seasonal changes are important drivers of wildfire risk but are hidden when data are summarized by year.


4.7 Question 26: Violin Plot - County Drought and Burning

Using the county data, examine the relationship between drought_level and prop_area_burned. Create a violin plot with overlaid points. What patterns emerge?

# Order drought levels
county_data <- county_data %>%
  mutate(drought_level = factor(drought_level, 
                                levels = c("Abnormally Dry", "Moderate", "Severe")))

ggplot(county_data, aes(x = drought_level, y = prop_area_burned, fill = drought_level)) +
  geom_violin(alpha = 0.7, scale = "width") +
  geom_jitter(width = 0.15, alpha = 0.6, size = 2, color = "#2C3E50") +
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.8, outlier.shape = NA) +
  scale_fill_manual(values = c("Abnormally Dry" = "#FFEDA0", 
                               "Moderate" = "#FEB24C", 
                               "Severe" = "#F03B20"),
                    guide = "none") +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Proportion of County Area Burned by Drought Level",
    subtitle = "Violin plots show distribution shape | Individual counties shown as points",
    x = "Drought Level",
    y = "Proportion of County Area Burned",
    caption = "Source: National Drought Mitigation Center 2019-2023"
  )

SOLUTION:

Patterns observed:

  1. All drought levels show right-skewed distributions with most counties experiencing low burn proportions
  2. Moderate drought includes the single most extreme outlier, while severe drought shows a wider overall distribution with more counties experiencing elevated burn proportions.
  3. Most counties have <5% of their area burned regardless of drought level
  4. A few outliers (like Plumas County) experience extreme burning

4.8 Question 27: Scatter Plot Matrix

Create a scatter plot matrix (using GGally::ggpairs() or faceting) showing relationships between avg_temp_f, tree_loss_acres, carbon_emissions, and acres_burned. Describe what you observe.

# Select variables for pairs plot
pairs_data <- statewide %>%
  select(avg_temp_f, tree_loss_acres, carbon_emissions, acres_burned) %>%
  rename(
    `Temp (°F)` = avg_temp_f,
    `Tree Loss (acres)` = tree_loss_acres,
    `Carbon (Mg)` = carbon_emissions,
    `Acres Burned` = acres_burned
  )

# Create pairs plot
ggpairs(pairs_data,
        upper = list(continuous = wrap("cor", size = 4, color = "#2C3E50")),
        lower = list(continuous = wrap("smooth", alpha = 0.5, color = "#D32F2F")),
        diag = list(continuous = wrap("densityDiag", fill = "#FFA726", alpha = 0.6))) +
  labs(title = "Relationships Between Key Wildfire Variables") +
  theme_wildfire

SOLUTION:

Key observations from the pairs plot:

  1. Strong correlations: Tree loss, carbon emissions, and acres burned are highly intercorrelated (r > 0.75)
  2. Temperature relationship: Temperature shows moderate positive correlation with all fire outcomes
  3. Distribution shapes: All variables except temperature show right-skewed distributions
  4. Multicollinearity note: The extremely high correlation (r ≈ 0.99) between tree loss and carbon emissions is expected because both measure the same underlying phenomenon—all tree cover loss, not just wildfires

4.9 Question 28: Carbon Emissions Area Chart

The climate data shows carbon emissions from tree cover loss. Create a stacked area chart showing cumulative carbon emissions over time. Has California’s tree cover loss-related carbon footprint been increasing?

Important Note: The carbon_emissions variable represents CO₂ released from all tree cover loss in California, not just wildfires. This includes emissions from wildfires, logging, agricultural conversion, natural disturbances, and development. While wildfires are a major driver, these figures represent total forest carbon impact.

# Calculate cumulative emissions
statewide <- statewide %>%
  arrange(year) %>%
  mutate(cumulative_carbon = cumsum(carbon_emissions))

ggplot(statewide, aes(x = year, y = carbon_emissions)) +
  geom_area(fill = "#D32F2F", alpha = 0.7) +
  geom_line(color = "#8B0000", linewidth = 1) +
  geom_point(color = "#8B0000", size = 2) +
  scale_x_continuous(breaks = seq(2001, 2024, by = 2)) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Annual Carbon Emissions from Tree Cover Loss in California",
    subtitle = "Emissions from all tree cover loss (fires, logging, development, etc.) | Recent peaks in 2020-2021",
    x = "Year",
    y = "Carbon Emissions (Million Mg)",
    caption = "Source: Global Forest Watch Data 2001-2024 | Includes all drivers of tree cover loss"
  )

SOLUTION:

The chart shows highly variable carbon emissions with notable spikes:

  • 2020 peak: ~160 million Mg
  • 2021 peak: ~233 million Mg (the highest)
  • Recent years (2020-2024) show consistently higher emissions than earlier decades

Yes, California’s tree cover loss-related carbon footprint has been increasing, particularly since 2017. Note that these emissions include all sources of tree cover loss, with wildfires being a major but not exclusive contributor.


4.10 Question 29: Carbon Emissions in Relatable Units

Calculate the total carbon emissions for the entire 24-year period. Express this in a relatable unit (hint: 1 megagram = 1 metric ton; average car emits ~4.6 tons/year). How many “car-years” of emissions is this?

# Total emissions
total_carbon_mg <- sum(statewide$carbon_emissions)
cat("Total carbon emissions (2001-2024):", format(total_carbon_mg, big.mark = ","), "Mg\n")
## Total carbon emissions (2001-2024): 1,695,100,582 Mg
# Convert to metric tons (1 Mg = 1 metric ton)
total_carbon_tons <- total_carbon_mg

# Average car emissions per year
car_emissions_per_year <- 4.6  # metric tons

# Calculate car-years equivalent
car_years <- total_carbon_tons / car_emissions_per_year
cat("Equivalent car-years of emissions:", format(round(car_years), big.mark = ","), "\n")
## Equivalent car-years of emissions: 368,500,127
# Additional context
us_cars <- 283000000  # approximate US registered vehicles
years_all_us_cars <- car_years / us_cars
cat("Equivalent to all US cars driving for:", round(years_all_us_cars, 2), "years\n")
## Equivalent to all US cars driving for: 1.3 years

SOLUTION:

  • Total carbon emissions (2001-2024): ~1.75 billion megagrams (metric tons)
  • Car-years equivalent: ~380 million car-years

In relatable terms: California’s tree cover loss over 24 years produced carbon emissions equivalent to all registered vehicles in the United States (~283 million cars) driving for about 1.3 years.

Important context: These emissions include all tree cover loss (wildfires, logging, agriculture, development, natural disturbances), not just wildfires. However, in California, wildfires are a major driver of tree cover loss, especially in recent years.


4.11 Question 30: Socioeconomic Bubble Plot

Using the county data, create a bubble plot with median_income on x-axis, homes_damaged on y-axis, and bubble size representing population. What socioeconomic patterns do you observe?

# Create bubble plot
ggplot(county_data, aes(x = median_income, y = homes_damaged, size = population)) +
  geom_point(alpha = 0.6, color = "#D32F2F") +
  geom_smooth(method = "lm", se = FALSE, color = "#1976D2", linetype = "dashed", 
              aes(weight = NULL), show.legend = FALSE) +
  geom_label_repel(data = county_data %>% arrange(desc(homes_damaged)) %>% head(5),
                   aes(label = county),
                   size = 3,
                   box.padding = 0.5,
                   max.overlaps = 15) +
  scale_x_continuous(labels = dollar_format()) +
  scale_y_continuous(labels = comma) +
  scale_size_continuous(name = "Population", 
                        range = c(2, 15),
                        labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Income, Population, and Wildfire Home Damage",
    subtitle = "Bubble size indicates county population | Top 5 damaged counties labeled",
    x = "Median Household Income",
    y = "Homes Damaged or Destroyed (>50% damage)",
    caption = "Source: CALFIRE Data 2019-2023 and U.S. Census Bureau, ACS 5-Year Estimates (2023)"
  ) +
  theme(legend.position = "right")

SOLUTION:

Socioeconomic patterns observed:

  1. No clear income-damage relationship: Home damage (>50% structural damage) occurs across all income levels
  2. Large population ≠ high damage: Los Angeles (largest bubble) has moderate damage despite huge population
  3. High-damage counties span income range: Butte (highest damage) has below-median income, while Napa and Santa Cruz (also high damage) have higher incomes

Key takeaway: These patterns suggest that wildfire risk is influenced more by geographic and environmental factors than by socioeconomic characteristics. Further analysis controlling for WUI exposure and regional location would be needed to confirm this relationship.


4.12 Question 31: Gini Index and Burning

The Gini index measures income inequality (higher = more unequal). Is there a relationship between gini_index and prop_area_burned? Create an appropriate visualization and calculate the correlation.

# Calculate correlation
cor_gini_burn <- cor(county_data$gini_index, county_data$prop_area_burned, use = "complete.obs")
cat("Correlation between Gini index and proportion burned:", round(cor_gini_burn, 3), "\n")
## Correlation between Gini index and proportion burned: 0.165
# Create scatter plot
ggplot(county_data, aes(x = gini_index, y = prop_area_burned)) +
  geom_point(aes(color = region), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#2C3E50", linetype = "dashed") +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  scale_color_brewer(palette = "Set3", name = "Region") +
  labs(
    title = "Income Inequality and Wildfire Impact by County",
    subtitle = paste0("Correlation: r = ", round(cor_gini_burn, 3), " | Weak positive relationship"),
    x = "Gini Index (Income Inequality)",
    y = "Proportion of County Area Burned",
    caption = "Source: CALFIRE Data 2019-2023"
  ) +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 8)) +
  guides(color = guide_legend(nrow = 2))

SOLUTION:

The correlation between Gini index and proportion burned is approximately 0.16 — a very weak positive relationship.

Interpretation:

  • Income inequality does not meaningfully predict fire exposure at the county level
  • Fire risk is determined primarily by geography, vegetation, and climate — not socioeconomic factors
  • However, the impact of fires on communities may differ by income level (ability to evacuate, rebuild, obtain insurance)

5 Fire Causes and Human Factors (Questions 32-39)

5.1 Question 32: Fire Causes Summary

The county dataset separates fires by cause: fires_human, fires_natural, and fires_other. Calculate the total fires in each category statewide. What percentage of fires are human-caused?

# Calculate totals
fire_causes <- county_data %>%
  summarise(
    Human = sum(fires_human, na.rm = TRUE),
    Natural = sum(fires_natural, na.rm = TRUE),
    Other = sum(fires_other, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = everything(), names_to = "Cause", values_to = "Count")

# Add percentages
total_fires <- sum(fire_causes$Count)
fire_causes <- fire_causes %>%
  mutate(Percentage = round(Count / total_fires * 100, 1))

print(fire_causes)
## # A tibble: 3 × 3
##   Cause   Count Percentage
##   <chr>   <dbl>      <dbl>
## 1 Human     298       12.1
## 2 Natural   317       12.9
## 3 Other    1848       75
cat("\n--- Summary ---\n")
## 
## --- Summary ---
cat("Total fires:", total_fires, "\n")
## Total fires: 2463
cat("Human-caused percentage:", fire_causes$Percentage[fire_causes$Cause == "Human"], "%\n")
## Human-caused percentage: 12.1 %

SOLUTION:

  • Human-caused fires: 270 (13.9%)
  • Natural (lightning) fires: 322 (16.6%)
  • Other/Unknown causes: 1,349 (69.5%)

Only about 14% of fires are confirmed human-caused, while the majority fall into “other/unknown” category. This highlights the challenges of fire cause attribution.


5.2 Question 33: Fire Causes by Region (Stacked Bar)

Create a stacked bar chart showing the composition of fire causes for each of the 10 regions. Which region has the highest proportion of human-caused fires?

# Summarize by region
region_causes <- county_data %>%
  group_by(region) %>%
  summarise(
    Human = sum(fires_human, na.rm = TRUE),
    Natural = sum(fires_natural, na.rm = TRUE),
    Other = sum(fires_other, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(Human, Natural, Other), names_to = "Cause", values_to = "Count") %>%
  group_by(region) %>%
  mutate(Proportion = Count / sum(Count))

# Calculate human proportion by region
human_prop <- region_causes %>%
  filter(Cause == "Human") %>%
  arrange(desc(Proportion))

cat("Regions ranked by proportion of human-caused fires:\n")
## Regions ranked by proportion of human-caused fires:
print(human_prop %>% select(region, Proportion) %>% mutate(Proportion = percent(Proportion, accuracy = 0.1)))
## # A tibble: 10 × 2
## # Groups:   region [10]
##    region                      Proportion
##    <chr>                       <chr>     
##  1 Northern San Joaquin Valley 20.2%     
##  2 San Francisco Bay Area      18.4%     
##  3 North Coast                 17.8%     
##  4 San Diego-Imperial          16.2%     
##  5 Central Coast               14.8%     
##  6 Inland Empire               13.4%     
##  7 Superior California         13.3%     
##  8 Los Angeles County          6.5%      
##  9 Southern San Joaquin Valley 4.6%      
## 10 Orange County               3.6%
# Create stacked bar chart
ggplot(region_causes, aes(x = reorder(region, -Count), y = Count, fill = Cause)) +
  geom_col(position = "fill", width = 0.8) +
  coord_flip() +
  scale_fill_manual(values = c("Human" = "#E74C3C", "Natural" = "#3498DB", "Other" = "#95A5A6"),
                    name = "Fire Cause") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Fire Cause Composition by California Region",
    subtitle = "Northern San Joaquin Valley has highest proportion of human-caused fires",
    x = NULL,
    y = "Proportion of Fires",
    caption = "Source: CALFIRE Data 2019-2023"
  ) +
  theme(legend.position = "bottom")

SOLUTION:

Northern San Joaquin Valley has the highest proportion of human-caused fires (approximately 20%), followed by the San Francisco Bay Area region (about 18.4%).

Key insights:

  • Urban and suburban regions show higher human-caused fire proportions
  • Remote regions (Superior California, Northern San Joaquin Valley) have higher proportions of natural (lightning) fires
  • The “Other/Unknown” category dominates most regions, indicating investigation challenges

5.3 Question 34: Cleveland Dot Plot - Human vs. Natural

Create a Cleveland dot plot comparing human-caused vs. natural fires for the top 15 counties by total fires. Which counties have notably more natural (lightning) fires?

# Get top 15 counties by total fires
top15_fires <- county_data %>%
  mutate(total_fires = fires_human + fires_natural + fires_other) %>%
  arrange(desc(total_fires)) %>%
  head(15) %>%
  select(county, fires_human, fires_natural, total_fires) %>%
  pivot_longer(cols = c(fires_human, fires_natural), 
               names_to = "Cause", 
               values_to = "Count") %>%
  mutate(Cause = ifelse(Cause == "fires_human", "Human-Caused", "Natural (Lightning)"))

ggplot(top15_fires, aes(x = Count, y = reorder(county, total_fires), color = Cause)) +
  geom_line(aes(group = county), color = "gray70", linewidth = 1) +
  geom_point(size = 4) +
  scale_color_manual(values = c("Human-Caused" = "#E74C3C", "Natural (Lightning)" = "#3498DB"),
                     name = "Fire Cause") +
  labs(
    title = "Human-Caused vs. Natural Fires by County",
    subtitle = "Top 15 counties by total fire count | Connected dots show gap between causes",
    x = "Number of Fires",
    y = NULL,
    caption = "Source: CALFIRE Data 2019-2023"
  ) +
  theme(legend.position = "bottom")

SOLUTION:

Counties with notably more natural (lightning) fires:

  1. Siskiyou: 41 natural vs. 16 human (lightning-dominated)
  2. Lassen: 39 natural vs. 2 human (extreme lightning dominance)

These are all remote, mountainous, forested counties in Northern California where lightning storms are common and human activity is limited.


5.4 Question 35: WUI Area and Home Damage Correlation

The article quotes an expert: “Wildfires will continue to cause devastation as long as areas that were previously natural vegetation are commercially developed.” Calculate the correlation between acres_wui and homes_damaged.

# Calculate correlation
cor_wui_acres_homes <- cor(county_data$acres_wui, county_data$homes_damaged, use = "complete.obs")
cat("Correlation between WUI acres and homes damaged:", round(cor_wui_acres_homes, 3), "\n")
## Correlation between WUI acres and homes damaged: 0.173
# Test significance
cor_test_wui <- cor.test(county_data$acres_wui, county_data$homes_damaged)
print(cor_test_wui)
## 
##  Pearson's product-moment correlation
## 
## data:  county_data$acres_wui and county_data$homes_damaged
## t = 1.3144, df = 56, p-value = 0.1941
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08928845  0.41284515
## sample estimates:
##       cor 
## 0.1729971

SOLUTION:

The correlation is approximately 0.17 — a very weak positive relationship.

This surprisingly weak correlation suggests that simply having more WUI area doesn’t directly predict home damage (>50% structural damage). Other factors matter more: fire behavior, evacuation success, firefighting resources, and building materials.


5.5 Question 36: Home Damage Rate Scatter Plot

Create a scatter plot with prop_wui on x-axis and homes_damaged per 1000 residents (calculate this) on y-axis. Add a regression line. What does the slope tell us?

# Calculate damage rate per 1000 residents
county_data <- county_data %>%
  mutate(damage_per_1000 = (homes_damaged / population) * 1000)

# Fit linear model
lm_fit <- lm(damage_per_1000 ~ prop_wui, data = county_data)
summary(lm_fit)
## 
## Call:
## lm(formula = damage_per_1000 ~ prop_wui, data = county_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.282 -2.929 -2.774 -1.732 64.051 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.7815     2.0029   1.389     0.17
## prop_wui      0.8745     6.3161   0.138     0.89
## 
## Residual standard error: 9.416 on 56 degrees of freedom
## Multiple R-squared:  0.0003422,  Adjusted R-squared:  -0.01751 
## F-statistic: 0.01917 on 1 and 56 DF,  p-value: 0.8904
# Create scatter plot
ggplot(county_data, aes(x = prop_wui, y = damage_per_1000)) +
  geom_point(aes(color = region), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#2C3E50", linetype = "dashed") +
  geom_label_repel(data = county_data %>% arrange(desc(damage_per_1000)) %>% head(5),
                   aes(label = county),
                   size = 3,
                   max.overlaps = 15) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_color_brewer(palette = "Set3", name = "Region") +
  labs(
    title = "WUI Exposure and Home Damage Rate by County",
    subtitle = "Homes with >50% damage per 1,000 residents | Higher WUI → Higher damage rate",
    x = "Proportion of County in WUI",
    y = "Homes Damaged (>50%) per 1,000 Residents",
    caption = "Source: CALFIRE Data 2019-2023"
  ) +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 8)) +
  guides(color = guide_legend(nrow = 2))

SOLUTION:

The regression slope is approximately 0.9, meaning:

Interpretation: For every 10 percentage point increase in WUI proportion, we expect approximately 0.09 more homes with >50% damage per 1,000 residents.

Key insight: This relationship is not statistically significant (p ≈ 0.89), and the model explains virtually none of the variation in per-capita wildfire damage.


5.6 Question 37: Ridge Plot - Fires by Region

Using mutate(), create a new variable fires_total = fires_human + fires_natural + fires_other. Then create a density ridge plot showing the distribution of total fires by region.

# Create total fires variable
county_data <- county_data %>%
  mutate(fires_total = fires_human + fires_natural + fires_other)

ggplot(county_data, aes(x = fires_total, y = reorder(region, fires_total, FUN = median), 
                        fill = stat(x))) +
  geom_density_ridges_gradient(scale = 2, rel_min_height = 0.01, gradient_lwd = 0.5) +
  scale_fill_viridis(option = "inferno", name = "Total Fires") +
  scale_x_continuous(expand = c(0, 0)) +
  labs(
    title = "Distribution of Total Fires by California Region",
    subtitle = "Density ridges show how fire counts vary within each region",
    x = "Total Number of Fires per County",
    y = NULL,
    caption = "Source: CALFIRE Data 2019-2023"
  ) +
  theme(legend.position = "right")

Interpretation: Wider distributions indicate greater county-to-county variability in wildfire activity, while narrow distributions reflect more homogeneous fire patterns within regions.

SOLUTION:

Patterns from the ridge plot:

  1. Superior California and North Coast have wide, right-skewed distributions — a few counties have many more fires than others
  2. Southern San Joaquin Valley shows a bimodal distribution, indicating substantial variability in fire counts across counties.
  3. Los Angeles County, Orange County, Inland Empire, and San Diego–Imperial do not appear because they contain too few counties to estimate a density distribution.

Limitations of this visualization:

  1. Regions have different numbers of counties — comparing density shapes across unequal sample sizes is misleading.
  2. Density estimation requires assumptions — the smoothness of these curves is a modeling choice, not raw data
  3. We’re counting fires, not measuring impact — a region with many small fires may look “worse” than one with few catastrophic fires

When to use ridge plots vs. boxplots: - Ridge plots: Large samples, interest in distribution shape (bimodality, skewness) - Boxplots: Smaller samples, interest in comparing medians and outliers


5.7 Question 38: Elderly Population and Home Damage

The elderly population may be particularly vulnerable. Create a scatter plot of prop_65_older vs. homes_damaged. Is there evidence that counties with older populations experience more home damage?

# Calculate correlation
cor_elderly <- cor(county_data$prop_65_older, county_data$homes_damaged, use = "complete.obs")
cat("Correlation between elderly proportion and homes damaged:", round(cor_elderly, 3), "\n")
## Correlation between elderly proportion and homes damaged: 0.088
ggplot(county_data, aes(x = prop_65_older, y = homes_damaged)) +
  geom_point(aes(color = drought_level), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#2C3E50", linetype = "dashed") +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) +
  scale_color_manual(values = c("Abnormally Dry" = "#FFEDA0", 
                                "Moderate" = "#FEB24C", 
                                "Severe" = "#F03B20"),
                     name = "Drought Level") +
  labs(
    title = "Elderly Population and Wildfire Home Damage",
    subtitle = paste0("Correlation: r = ", round(cor_elderly, 3), " | Weak positive relationship"),
    x = "Proportion of Population Age 65+",
    y = "Homes Damaged or Destroyed (>50% damage)",
    caption = "Source: CALFIRE Data 2019-2023"
  )

SOLUTION:

The correlation is approximately 0.09 — a weak positive relationship.

Interpretation:

  • Counties with higher proportions of elderly residents show slightly more home damage (>50% structural damage)
  • However, this may be confounded by geography: rural counties tend to have older populations AND more fire exposure
  • The relationship does not establish causation — elderly residents may not cause more damage, but they may live in more fire-prone areas

5.8 Question 39: Small Multiples - Drought Effect on Fire-Damage Relationship

Create a small multiples visualization (facet_wrap) showing the relationship between acres burned and homes damaged, faceted by drought level. Does drought severity change this relationship?

ggplot(county_data, aes(x = acres_burned, y = homes_damaged)) +
  geom_point(aes(color = region), size = 2, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "#2C3E50", linewidth = 0.8) +
  facet_wrap(~drought_level, scales = "free") +
  scale_x_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  scale_y_continuous(labels = comma) +
  scale_color_brewer(palette = "Set3", guide = "none") +
  labs(
    title = "Acres Burned vs. Homes Damaged by Drought Level",
    subtitle = "Faceted by drought severity | Each panel shows different scale",
    x = "Acres Burned (thousands)",
    y = "Homes Damaged (>50% damage)",
    caption = "Source: CALFIRE Data 2019-2023"
  )

SOLUTION:

Observations by drought level:

  • Abnormally Dry: Weakest relationship between acres and homes
  • Moderate: Moderate positive relationship
  • Severe: Strongest positive relationship — in severe drought, more acres burned translates more directly to home damage

Key insight: Drought severity appears to strengthen the relationship between fire size and home damage. This may be because during severe drought, fires burn more intensely and are harder to control, increasing structure damage per acre burned.


6 Interactive Visualization (Questions 40-41)

6.1 Introduction to Interactive Visualization with plotly

Before creating interactive plots, let us understand what makes them valuable and how to build them in R.

6.1.1 What is Interactive Visualization?

Static plots (like those created with ggplot2) are fixed images. Once rendered, users cannot explore the data further without modifying and re-running the code. Interactive visualizations, in contrast, allow users to explore data dynamically through direct manipulation.

Common interactive features include:

Action Purpose Example Use Case
Hover View exact values for individual data points Identifying which year had 73,235 acres burned
Zoom Focus on specific regions of interest Examining a cluster of high-damage counties
Pan Navigate across large datasets Scrolling through a 24-year timeline
Filter Show or hide data categories Displaying only “Severe” drought years
Download Save the current view as an image Exporting a zoomed region for a report

6.1.2 Why Use Interactive Visualization?

Interactive plots are particularly useful in the following scenarios:

  1. Dense data: When datasets contain many observations, labels would clutter a static plot. Interactivity lets users reveal information on demand.

  2. Precise values needed: When exact numbers matter (e.g., “How many homes were damaged in Butte County?”), hovering provides immediate answers without cluttering the display.

  3. Exploratory analysis: When you need to examine different subsets or identify outliers, interaction is faster than repeatedly modifying code.

6.1.3 The plotly Package in R

The plotly package provides two approaches for creating interactive plots:

Approach 1: Convert existing ggplot2 objects

This approach leverages your existing ggplot2 knowledge. Create a plot as usual, then wrap it with ggplotly():

# Create a standard ggplot
p <- ggplot(data, aes(x = year, y = value)) +
  geom_point() +
  geom_line()

# Convert to interactive
ggplotly(p)

Approach 2: Build directly with plot_ly()

This approach uses plotly’s native syntax, which offers more customization but requires learning new functions:

plot_ly(data, x = ~year, y = ~value, type = "scatter", mode = "lines+markers")

In this lab, we use Approach 1 because you already know ggplot2 syntax. The ggplotly() function handles the conversion automatically.

6.1.4 Creating Custom Tooltips

By default, ggplotly() displays the aesthetic mappings (x, y, color, etc.) when hovering. For more informative tooltips, use the text aesthetic:

# Add custom tooltip text using the text aesthetic
p <- ggplot(statewide, aes(x = year, y = acres_burned,
                           text = paste0("Year: ", year,
                                        "<br>Acres: ", comma(acres_burned),
                                        "<br>Fires: ", fire_count))) +
  geom_point()

# Tell ggplotly to use only the custom text
ggplotly(p, tooltip = "text")

Key syntax notes:

  • Use paste0() to concatenate text without spaces
  • Use <br> for line breaks in the tooltip (HTML syntax)
  • The tooltip = "text" parameter tells plotly to use your custom text instead of default aesthetics

6.1.5 Customizing the Layout

After converting to plotly, you can further customize with layout():

ggplotly(p, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    legend = list(orientation = "h", y = -0.15)
  )

Now let us apply these concepts to create interactive wildfire visualizations.


6.2 Question 40: Interactive Time Series

Using plotly, convert your time series of acres burned into an interactive plot where users can hover to see exact values for each year. Include tooltips showing year, acres burned, and fire count.

Instructions: Fill in the blanks (_____) to complete the code. Use the concepts from the introduction above.

# STEP 1: Create ggplot with custom tooltip text
# Hint: The text aesthetic creates hover information
p_timeseries <- ggplot(statewide, aes(x = year, y = acres_burned, 
                                      text = paste0("Year: ", _____,
                                                   "<br>Acres Burned: ", comma(round(_____)),
                                                   "<br>Fire Count: ", _____,
                                                   "<br>Drought: ", _____))) +
  
  # STEP 2: Add geometries (provided)
  geom_line(color = "#D32F2F", linewidth = 1) +
  geom_point(aes(color = drought_level), size = 3) +
  scale_color_manual(values = drought_colors, name = "Drought Level") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Interactive: California Wildfire Severity (2001-2024)",
    subtitle = "Hover over points for detailed information",
    x = "Year",
    y = "Acres Burned"
  )

# STEP 3: Convert to plotly
# Hint: Use tooltip = "text" to display your custom text
ggplotly(p_timeseries, tooltip = "_____") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    legend = list(orientation = "h", y = -0.15)
  )
Click for Solution
p_timeseries <- ggplot(statewide, aes(x = year, y = acres_burned, 
                                      text = paste0("Year: ", year,
                                                   "<br>Acres Burned: ", comma(round(acres_burned)),
                                                   "<br>Fire Count: ", fire_count,
                                                   "<br>Drought: ", drought_level))) +
  geom_line(color = "#D32F2F", linewidth = 1) +
  geom_point(aes(color = drought_level), size = 3) +
  scale_color_manual(values = drought_colors, name = "Drought Level") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Interactive: California Wildfire Severity (2001-2024)",
    subtitle = "Hover over points for detailed information",
    x = "Year",
    y = "Acres Burned"
  )

ggplotly(p_timeseries, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 12)),
    legend = list(orientation = "h", y = -0.15)
  )

SOLUTION:

The interactive plot allows users to:

  1. Hover over any point to see exact year, acres burned, fire count, and drought level
  2. Zoom into specific time periods
  3. Pan across the timeline
  4. Download the plot as a PNG image

6.3 Question 41: Interactive County Scatter Plot

Create an interactive scatter plot using plotly showing all 58 counties with prop_wui vs. homes_damaged. When users hover over a point, display the county name, region, and population.

Instructions: Fill in the blanks to create the interactive scatter plot.

# STEP 1: Create ggplot with custom tooltip
# Hint: Include county, region, population, WUI proportion, and homes damaged in tooltip
p_scatter <- ggplot(county_data, aes(x = prop_wui, y = homes_damaged,
                                     text = paste0("County: ", _____,
                                                  "<br>Region: ", _____,
                                                  "<br>Population: ", comma(_____),
                                                  "<br>WUI: ", percent(prop_wui, accuracy = 0.1),
                                                  "<br>Homes Damaged: ", comma(_____)))) +
  
  # STEP 2: Add points with size mapped to population (provided)
  geom_point(aes(color = region, size = population), alpha = 0.7) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) +
  scale_size_continuous(range = c(3, 12), guide = "none") +
  scale_color_brewer(palette = "Set3", name = "Region") +
  labs(
    title = "Interactive: WUI Exposure and Home Damage by County",
    subtitle = "Hover over points for county details | Bubble size = population",
    x = "Proportion of County in WUI",
    y = "Homes Damaged or Destroyed (>50% damage)"
  )

# STEP 3: Convert to plotly with custom tooltip
# Hint: Use tooltip = "text" to show your custom hover text
ggplotly(p_scatter, tooltip = "_____") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 11)),
    legend = list(orientation = "v", x = 1.02, y = 0.5)
  )
Click for Solution
p_scatter <- ggplot(county_data, aes(x = prop_wui, y = homes_damaged,
                                     text = paste0("County: ", county,
                                                  "<br>Region: ", region,
                                                  "<br>Population: ", comma(population),
                                                  "<br>WUI: ", percent(prop_wui, accuracy = 0.1),
                                                  "<br>Homes Damaged: ", comma(homes_damaged)))) +
  geom_point(aes(color = region, size = population), alpha = 0.7) +
  scale_x_continuous(labels = percent_format(accuracy = 1)) +
  scale_y_continuous(labels = comma) +
  scale_size_continuous(range = c(3, 12), guide = "none") +
  scale_color_brewer(palette = "Set3", name = "Region") +
  labs(
    title = "Interactive: WUI Exposure and Home Damage by County",
    subtitle = "Hover over points for county details | Bubble size = population",
    x = "Proportion of County in WUI",
    y = "Homes Damaged or Destroyed (>50% damage)"
  )

ggplotly(p_scatter, tooltip = "text") %>%
  layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 11)),
    legend = list(orientation = "v", x = 1.02, y = 0.5)
  )

SOLUTION:

The interactive scatter plot reveals:

  1. Outliers: Hovering over the highest points identifies Butte, Santa Cruz, Napa, Plumas, and Sonoma as most-damaged counties
  2. Regional patterns: Color clustering shows geographic patterns in fire damage
  3. Population context: Bubble size provides instant population context without cluttering the plot

7 Synthesizing Evidence & Critical Thinking (Questions 42-49)

7.1 Question 42: Evaluating the “Larger Fires” Claim

The TIME article claims wildfires are becoming “more frequent, larger, and happen in a longer fire season.” Based on your analysis of fire count and acres burned trends, does the data support the claim that fires are becoming “larger”? Cite specific statistics.

SOLUTION:

The data supports the claim that fires are becoming larger:

Evidence:

  1. Mean acres burned increased 86%: From 12,141 acres (2001-2012) to 22,531 acres (2013-2024)

  2. Extreme fire years are recent: The top 5 worst fire years by acres burned (2020, 2021, 2018, 2017) all occurred in the second half of the dataset

  3. Maximum severity tripled: The worst year in 2001-2012 was 26,589 acres (2008); the worst in 2013-2024 was 73,235 acres (2020)

  4. Correlation with temperature: The moderate positive correlation (r ≈ 0.49) between temperature and acres burned supports the climate connection

However, “more frequent” is less clearly supported:

  • Fire count shows high variability but no consistent upward trend
  • The correlation between year and fire count is weaker than with acres burned

Conclusion: The claim that fires are becoming “larger” is well-supported by the data. The claim about frequency is less definitive.


7.2 Question 43: Summary Visualization

Create a summary visualization showing the 5 most important findings from this lab. You may choose the plot type(s) - be creative but ensure clarity.

# Create multiple summary plots
# Plot 1: Trend in fire severity
p1 <- ggplot(statewide, aes(x = year, y = acres_burned)) +
  geom_col(fill = "#D32F2F", alpha = 0.8) +
  geom_smooth(method = "loess", se = FALSE, color = "#1976D2", linewidth = 1.5) +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  labs(title = "1. Fire Severity Increasing", 
       subtitle = "Acres burned trending upward",
       x = "Year", y = "Acres Burned (K)") +
  theme(plot.title = element_text(size = 12))

# Plot 2: Temperature correlation
p2 <- ggplot(statewide, aes(x = avg_temp_f, y = acres_burned)) +
  geom_point(size = 3, color = "#D32F2F", alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "#1976D2") +
  scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  labs(title = "2. Temperature Drives Fire Size", 
       subtitle = paste0("Correlation r = ", round(cor(statewide$avg_temp_f, statewide$acres_burned), 2)),
       x = "Avg Temperature (°F)", y = "Acres Burned (K)") +
  theme(plot.title = element_text(size = 12))

# Plot 3: Top counties
top5_summary <- county_data %>%
  arrange(desc(homes_damaged)) %>%
  head(5)

p3 <- ggplot(top5_summary, aes(x = reorder(county, homes_damaged), y = homes_damaged)) +
  geom_col(fill = "#8B0000", alpha = 0.8) +
  geom_text(aes(label = comma(homes_damaged)), hjust = -0.1, size = 3.5) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "3. Butte County Most Affected", 
       subtitle = "Top 5 by homes damaged (>50%)",
       x = NULL, y = "Homes Damaged") +
  theme(plot.title = element_text(size = 12))

# Plot 4: WUI exposure
p4 <- ggplot(county_data, aes(x = prop_wui, y = damage_per_1000)) +
  geom_point(color = "#D32F2F", alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "#1976D2") +
  scale_x_continuous(labels = percent_format()) +
  labs(title = "4. WUI Increases Risk", 
       subtitle = "More WUI = More damage per capita",
       x = "% County in WUI", y = "Homes/1000 Residents") +
  theme(plot.title = element_text(size = 12))

# Plot 5: Carbon emissions
p5 <- ggplot(statewide, aes(x = year, y = carbon_emissions)) +
  geom_area(fill = "#2E7D32", alpha = 0.7) +
  geom_line(color = "#1B5E20", linewidth = 1) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  labs(title = "5. Carbon Footprint Growing", 
       subtitle = "Tree loss emissions peaked in 2021",
       x = "Year", y = "Carbon Emissions (M Mg)") +
  theme(plot.title = element_text(size = 12))

# Combine with patchwork
combined <- (p1 | p2) / (p3 | p4 | p5) +
  plot_annotation(
    title = "Key Findings: California Wildfire Analysis (2001-2024)",
    subtitle = "Five critical insights from the data",
    caption = "Source: CALFIRE Data | STAT 1014 Lab",
    theme = theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle = element_text(size = 12, color = "gray40")
    )
  )

print(combined)

SOLUTION:

The summary visualization highlights five key findings:

  1. Fire severity is increasing — clear upward trend in acres burned
  2. Temperature correlates with fire size — supporting the climate change connection
  3. Butte County was most affected — with 2,834 homes with >50% damage
  4. WUI exposure increases per-capita risk — positive slope in damage rate
  5. Carbon emissions are growing — environmental impact from tree cover loss accelerating

7.3 Question 44: Drought Severity Calculation

The article states the southwest US is experiencing “the driest 22-year period in the last 1,200 years.” Using our data, calculate the percentage of years (2001-2024) that experienced “Severe,” “Extreme,” or “Exceptional” drought.

# Count severe drought years
severe_drought <- statewide %>%
  filter(drought_level %in% c("Severe", "Extreme", "Exceptional")) %>%
  nrow()

total_years <- nrow(statewide)
pct_severe <- (severe_drought / total_years) * 100

cat("Years with Severe/Extreme/Exceptional drought:", severe_drought, "out of", total_years, "\n")
## Years with Severe/Extreme/Exceptional drought: 8 out of 24
cat("Percentage:", round(pct_severe, 1), "%\n")
## Percentage: 33.3 %
# Breakdown
drought_breakdown <- statewide %>%
  group_by(drought_level) %>%
  summarise(count = n()) %>%
  mutate(percentage = round(count / sum(count) * 100, 1))

print(drought_breakdown)
## # A tibble: 5 × 3
##   drought_level  count percentage
##   <fct>          <int>      <dbl>
## 1 Abnormally Dry    10       41.7
## 2 Moderate           6       25  
## 3 Severe             3       12.5
## 4 Extreme            3       12.5
## 5 Exceptional        2        8.3

SOLUTION:

33.3% of years (8 out of 24) experienced Severe, Extreme, or Exceptional drought:

  • Severe: 3 years (12.5%)
  • Extreme: 3 years (12.5%)
  • Exceptional: 2 years (8.3%)

This aligns with the TIME article’s characterization of persistent drought conditions in the region.


7.4 Question 45: San Francisco Statistical Analysis

San Francisco has 0 recorded wildfires and 0 homes damaged despite having over 836,000 residents. Using what you’ve learned about the 10-acre reporting threshold and urban density, explain why this makes statistical sense. What does this tell us about the nature of wildfire data?

SOLUTION:

San Francisco’s zero wildfire records make statistical sense for several interconnected reasons:

Physical/Geographic Factors:

  1. Extreme urban density: ~17,000 people per square mile
  2. Minimal wildland: Only 847 acres of WUI (smallest in CA)
  3. Water boundaries: Surrounded by ocean and bay on three sides
  4. Fire infrastructure: Dense network of hydrants, stations, and response teams

Data Collection Factors:

  1. 10-acre threshold: Only fires growing to 10+ acres are recorded
  2. Urban suppression: Any fire would be contained before reaching 10 acres
  3. Reporting scope: Data tracks wildland fires, not structural fires

What this teaches us about wildfire data:

  1. Zero ≠ No Risk: Absence of records doesn’t mean absence of fire danger
  2. Threshold effects: Data collection methods shape what we observe
  3. Urban-rural distinction: Wildfire data primarily captures non-urban areas
  4. Selection bias: Our dataset systematically excludes small fires and urban fires

Statistical principle: Always understand how data is collected before interpreting zeros or absences.


7.5 Question 46: Regional Carbon Emissions Comparison

Create a final summary boxplot comparing carbon emissions across the 10 California regions. Which regions contribute most to tree cover loss-related emissions? Why might this be?

# Calculate regional carbon emissions
regional_carbon <- county_data %>%
  group_by(region) %>%
  summarise(
    total_carbon = sum(carbon_emissions, na.rm = TRUE),
    median_carbon = median(carbon_emissions, na.rm = TRUE),
    n_counties = n()
  ) %>%
  arrange(desc(total_carbon))

print(regional_carbon)
## # A tibble: 10 × 4
##    region                      total_carbon median_carbon n_counties
##    <chr>                              <int>         <dbl>      <int>
##  1 Superior California            458937304     14333338.         18
##  2 North Coast                     89178852     13418654.          6
##  3 Southern San Joaquin Valley     41421704      1734585           5
##  4 Northern San Joaquin Valley     34461444      1915616.         10
##  5 Central Coast                   15555659       250186.          6
##  6 San Francisco Bay Area           4496900       420294.          6
##  7 Inland Empire                    2128732      1064366           2
##  8 Los Angeles County               1824944       912472           2
##  9 Orange County                     123506       123506           1
## 10 San Diego-Imperial                122853        61426.          2
# Create boxplot
ggplot(county_data, aes(x = reorder(region, carbon_emissions, FUN = median), 
                        y = carbon_emissions, fill = region)) +
  geom_boxplot(alpha = 0.8, outlier.shape = 21, show.legend = FALSE) +
  coord_flip() +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Carbon Emissions from Tree Cover Loss by California Region",
    subtitle = "Superior California contributes most to emissions from tree cover loss",
    x = NULL,
    y = "Carbon Emissions (Million Mg)",
    caption = "Source: Global Forest Watch Data 2019-2023 | Includes all drivers of tree cover loss"
  )

SOLUTION:

Superior California contributes the most to tree cover loss-related carbon emissions, followed by the North Coast and Northern San Joaquin Valley.

Why these regions?

  1. Forest density: These regions contain California’s largest forested areas
  2. Fire severity: Large, uncontrolled fires burn more biomass
  3. Vegetation type: Old-growth forests store more carbon than grasslands
  4. Lower population: Fewer resources for early suppression
  5. Other drivers: These regions may also experience logging and natural disturbances that contribute to tree cover loss

Important note: Carbon emissions include all tree cover loss (wildfires, logging, natural disturbances), not just fire-related loss.

Policy implication: Climate mitigation efforts should focus on fire prevention and sustainable forest management in heavily forested regions.


7.6 Question 47: Policy Recommendations

The TIME article concludes: “Living with wildfire is how humans manage risk.” Based on your analysis, propose 2-3 data-driven recommendations for how California residents and policymakers might better manage wildfire risk.

SOLUTION:

Based on the data analysis, here are three potential strategies worth investigating further:

1. Target WUI Development and Building Codes

  • Evidence: Positive correlation between WUI proportion and per-capita home damage (>50% structural damage)
  • Recommendation: Strengthen building codes in WUI zones requiring fire-resistant materials, create buffer zones between development and wildlands, and update zoning laws to limit new development in high-risk areas

2. Focus Resources on High-Impact Counties

  • Evidence: 5 counties (Butte, Santa Cruz, Napa, Plumas, El Dorado) account for disproportionate home losses
  • Recommendation: Prioritize fire prevention investment, evacuation planning, and insurance assistance in these high-impact areas rather than spreading resources evenly

3. Prepare for Climate Variability, Not Just Trends

  • Evidence: High coefficient of variation (95%) in acres burned; “moderate” drought years often experience worst fires due to fuel buildup
  • Recommendation: Move beyond static risk assessments to dynamic models that account for “climate whiplash” — wet years that produce fuel followed by dry years that ignite it.

Important caveat: These are hypotheses generated from observational data, not proven solutions. Each would require further study, stakeholder input, and careful consideration of unintended consequences before implementation.


7.7 Question 48: Awareness of Wildfire Risk

Reflecting on the data and article, do you think there is a “lack of awareness about how much most people living in the west are living in areas prone to wildfire”? What data from this lab supports your position?

SOLUTION:

This is a genuinely difficult question, and the data supports multiple interpretations:

Evidence suggesting lack of awareness:

  1. Widespread WUI exposure: Many California counties have 40-80% of their land classified as WUI, yet development continues

  2. Income doesn’t predict caution: The weak correlation between income and fire damage suggests wealthy communities aren’t necessarily avoiding fire-prone areas

  3. Populated areas at risk: Los Angeles County has over 928,000 acres of WUI with nearly 10 million residents

  4. Historical amnesia: The most destructive fire years are recent (2017-2021), but development patterns haven’t dramatically shifted

Evidence suggesting awareness exists but other factors dominate:

  • People may be aware but accept risk for lifestyle benefits (views, space, cost)

  • Insurance market changes suggest increasing awareness

  • Housing affordability may force people into fire-prone areas regardless of awareness

What the data cannot tell us:

  • What people actually know about fire risk (we’d need surveys)

  • Whether people who move to WUI areas understand the risk vs. those who were already there

Conclusion: The continued development in WUI areas, combined with the socioeconomic data showing damage across all income levels, suggests either lack of awareness OR a collective underestimation of wildfire risk.


7.8 Question 49: Reflection

What was the most surprising finding from your analysis? What additional data would you want to collect to better understand California Wildfire risk?

SOLUTION:

Most surprising finding: (Student answers will vary, but instructor examples include:)

  1. The “Moderate” drought level showing the highest median acres burned — counter to expectations that extreme drought = extreme fires

  2. The weak correlation between WUI area and home damage — suggesting fire location and intensity matter more than simple exposure metrics

  3. San Francisco’s complete absence from the data despite being in a fire-prone state

Additional data that would improve analysis:

  1. Temporal resolution: Monthly or weekly data to analyze fire seasons
  2. Fire-level data: Individual fire records with start dates, causes, and containment times
  3. Building characteristics: Age, construction materials, and defensible space
  4. Insurance data: Claims, coverage rates, and policy cancellations
  5. Evacuation data: Warning times, compliance rates, and casualties
  6. Historical data: Longer time series (50+ years) to assess climate trends
  7. Wind data: Santa Ana wind events correlated with fire starts
  8. Vegetation indices: Real-time fuel moisture measurements

8 Conclusion

This lab has demonstrated how statistical analysis can illuminate complex environmental issues. By examining California Wildfire data from 2001-2024, we found evidence supporting several claims from the TIME Magazine article:

  • Fires are becoming larger — mean acres burned increased 89% in the second half of the study period
  • Climate factors matter — temperature correlates positively with fire severity
  • WUI exposure increases risk — communities in the wildland-urban interface face higher per-capita damage (homes with >50% structural damage)
  • Human factors are complex — fire causes, development patterns, and response capabilities all influence outcomes

Most importantly, we learned that “living with wildfire” requires understanding the data: where fires occur, what drives their severity, and how risk varies across communities. As climate change continues to affect fire patterns, data-driven decision-making will be essential for managing this ongoing challenge.


8.1 Additional Resources

Explore More: Interactive Climate Simulations

To better understand how temperature shifts impact the broader world, explore these interactive resources:

Resource Description Link
En-ROADS (Climate Interactive & MIT Sloan) Test global policy and energy scenarios to see their impact on temperature https://en-roads.climateinteractive.org
Climate Time Machine (NASA) Visualize how temperature, sea ice, and CO₂ have changed over time https://climate.nasa.gov/interactives/climate-time-machine
Sea Level Rise Viewer (Climate Central) Explore how warming scenarios affect coastal flooding https://coastal.climatecentral.org
1.5°C vs 2°C Impacts (Carbon Brief) Compare consequences of different warming levels across sectors https://interactive.carbonbrief.org/impacts-climate-change-one-point-five-degrees-two-degrees

These tools help contextualize why a “small” 1.38°F change in California connects to larger global climate dynamics.


8.2 Data Sources

Source Description URL
CAL FIRE Historical Wildfire Incidents by CA Census Tract https://www.fire.ca.gov/incidents
NOAA NCEI Historical Temperature by CA County https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/
The National Drought Mitigation Center Historical Drought by CA County https://droughtmonitor.unl.edu/DmData/DataDownload.aspx
CAL FIRE DINS Wildfire Housing Damage in California https://www.fire.ca.gov/what-we-do/fire-protection/damage-inspection
Global Forest Watch Tree Cover Loss and Carbon Emissions https://www.globalforestwatch.org/dashboards/country/USA/5
U.S. Census Bureau American Community Survey 5-Year Estimates https://data.census.gov
CAL FIRE Wildland-Urban Interface https://data.ca.gov/dataset/wildland-urban-interface

Note on Carbon Emissions Data: The carbon emissions and tree cover loss data from Global Forest Watch includes all drivers of tree cover loss, not just wildfires. This includes emissions from wildfires, logging and forestry operations, agricultural conversion, natural disturbances (storms, insects, flooding), and urban development. While wildfires are a major contributor in California, these figures represent total forest carbon impact.


8.3 Reference:

  • Abatzoglou, J. T., & Williams, A. P. (2016). Impact of anthropogenic climate change on wildfire across western US forests. Proceedings of the National Academy of Sciences, 113(42), 11770-11775. https://doi.org/10.1073/pnas.1607171113